Using Case-Based Reasoning for Spam Filtering
نویسندگان
چکیده
Spam is a universal problem with which everyone is familiar. Figures published in 2005 state that about 75% of all email sent today is spam. In spite of significant new legal and technical approaches to combat it, spam remains a big problem that is costing companies meaningful amounts of money in lost productivity, clogged email systems, bandwidth and technical support. A number of approaches are used to combat spam including legislative measures, authentication approaches and email filtering. The most common filtering technique is content-based filtering which uses the actual text of the message to determine whether it is spam or not. One of the main challenges of content based spam filtering is concept drift; the concept or the characteristics used by the filter to identify spam email are constantly changing over time. Concept drift is very evident in email and spam, in part due to the arms race that exists between the spammers and the filter producers. The spammers continually change the content and structure of the spam emails as the filters are modified to catch them. In this thesis we present Email Classification Using Examples (ECUE) a content based approach to spam filtering that can handle the concept drift inherent in spam email. We apply the machine learning technique of case-based reasoning which models the emails as cases in a knowledge-base or case-base. The approach used in ECUE involves two components; a case-base editing stage and a case-base update policy. We present a new technique for case-base editing called Competence-Based Editing which uses the competence properties of the cases in the case-base to determine which cases are harmful to the predictive power of the case-base and should be removed. The update policy allows new examples of spam and legitimate emails to be added to the case-base as they are encountered allowing ECUE to track the concept drift. We compare the case-based approach to an ensemble approach which is a more standard technique for handling concept drift and present a prototype email filtering applica-
منابع مشابه
A Case-Based Approach to Spam Filtering that Can Track Concept Drift
There are a few key benefits of a case-based approach to spam filtering. First, the many different sub-types of spam suggest that a local learner, such as Case-Based Reasoning (CBR) will perform well. Second, the lazy approach to learning in CBR allows for easy updating as new types of spam arrive. Third, the case-based approach to spam filtering allows for the sharing of cases and thus a shari...
متن کاملCatching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering
In this paper, we compare case-based spam filters, focusing on their resilience to concept drift. In particular, we evaluate how to track concept drift using a case-based spam filter that uses a featurefree distance measure based on text compression. In our experiments, we compare two ways to normalise such a distance measure, finding that the one proposed in [1] performs better. We show that a...
متن کاملDifferential Voting in Case Based Spam Filtering
Case-based reasoning (CBR) has been shown to be of considerable utility in a spam-filtering task. In the course of this study, we propose that the non-random skewed distribution of the cases in a case base is crucial, especially in the context of a classification task like spam filtering. In this paper, we propose approaches to improve the performance of a CBR spam filter by making use of the n...
متن کاملSpamHunting: An instance-based reasoning system for spam labelling and filtering
In this paper we show an instance-based reasoning e-mail filtering model that outperforms classical machine learning techniques and other successful lazy learners approaches in the domain of anti-spam filtering. The architecture of the learning-based anti-spam filter is based on a tuneable enhanced instance retrieval network able to accurately generalize e-mail representations. The reuse of sim...
متن کاملTwo Approaches on Implementation of CBR and CRM Technologies to the Spam Filtering Problem
Recently the number of undesirable messages coming to e-mail has strongly increased. As spam has changeable character the anti-spam systems should be trainable and dynamical. The machine learning technology is successfully applied in a filtration of e-mail from undesirable messages for a long time. In this paper it is offered to apply Case Based Reasoning technology to a spam filtering problem....
متن کامل